[GLUTEN-10568] [VL] Pass the table schema to the HiveTableHandle#10569
[GLUTEN-10568] [VL] Pass the table schema to the HiveTableHandle#10569kevinwilfong wants to merge 1 commit intoapache:mainfrom
Conversation
|
Run Gluten Clickhouse CI on x86 |
aca8302 to
c1a3f30
Compare
c1a3f30 to
8a33c16
Compare
| splitInfos.zipWithIndex.map { | ||
| case (splitInfos, index) => | ||
| case (splits, index) => | ||
| val splitsByteArray = splits.zipWithIndex.map { |
There was a problem hiding this comment.
could we inject table schema into substrait plan?
the draft implementation will bring more GC in driver.
There was a problem hiding this comment.
When you say inject it into the substrait plan do you mean in the ReadRel?
If so definitely, this was something I was already considering I just wasn't sure how open folks are to updating the substrait protobufs, I borrowed this approach from what's already done in the code for ClickHouse.
We could probably deprecate the schema field in FileOrFiles. It looks like the only place it's used is that ClickHouse code path, and there it looks like it's put there because they only want to set it when the file is in the TextFile format, we could probably change that logic to only consume the field if it's in the TextFile format if that's not already the way it is.
There was a problem hiding this comment.
When you say inject it into the substrait plan do you mean in the ReadRel?
yes, you can try modify gluten-substrait/src/main/resources/substrait/proto/substrait/algebra.proto
We could probably deprecate the schema field in FileOrFiles
+1
There was a problem hiding this comment.
@Yohahaha I see now why they originally added it at the split level for ClickHouse
They only wanted to add the table schema if it was necessary, which today is only if a file is in the TextFileFormat. Given this is a property of the partition/split we don't know this at the time we're constructing the plan.
If we always set the table schema then it's possible the table schema has some column type that Gluten doesn't support (e.g. TimestampNTZ) which causes an exception serializing the ReadRelNode.
We can optionally add it to the substrait plan for the case where we want to support index based column resolution, which can be determined at plan generation time, however, to support file formats that always depend on knowing the schema like Text files, I think we'll need to keep it at the split level.
There was a problem hiding this comment.
See the failure in the test "SPARK-36726: test incorrect Parquet row group file offset" in GlutenParquetIOSuite in gluten-ut/spark35 when I add it at the plan level
8a33c16 to
e92e9ad
Compare
|
Run Gluten Clickhouse CI on x86 |
e92e9ad to
1856d11
Compare
|
Run Gluten Clickhouse CI on x86 |
1856d11 to
2bb6797
Compare
|
Run Gluten Clickhouse CI on x86 |
2bb6797 to
501b454
Compare
|
Run Gluten Clickhouse CI on x86 |
501b454 to
5c0ae08
Compare
|
Run Gluten Clickhouse CI on x86 |
5c0ae08 to
d7ba2e9
Compare
|
Run Gluten Clickhouse CI on x86 |
d7ba2e9 to
7663b2f
Compare
|
Run Gluten Clickhouse CI on x86 |
|
This PR is stale because it has been open 45 days with no activity. Remove stale label or comment or this will be closed in 10 days. |
|
This PR was auto-closed because it has been stalled for 10 days with no activity. Please feel free to reopen if it is still valid. Thanks. |
What changes are proposed in this pull request?
This PR changes the SubstraitToVeloxPlanConverter to pass the table schema rather than just the schema of the columns we
read from the files to the HiveTableHandle.
This is a necessary prerequisite to supporting index based column resolution, whether that's reading from files using column
positions rather than names to map between the file schema and the table schema, or reading from file formats that do not
contain schema information like Text.
To do this I updated VeloxIteratorApi to set the file schema of LocalFilesNodes to the data schema of the Scan (when
present). This is similar to what's already done in the Iterator API for ClickHouse. I then parse that schema and pass it to the
SplitInfo in VeloxPlanConverter. Finally, I extract it from the SplitInfo in SubstraitToVeloxPlanConverter and pass it to the
HiveTableHandle constructor in place of the base schema.
This should not produce any noticeable effect for the existing code paths/file formats as the table schema is a superset of
the base schema and file columns are currently mapped to table columns exclusively by name.
How was this patch tested?
Ran the existing unit tests. This change should not change any existing behavior, but should enable future changes.